Prediction on the rent of flats and houses in Delhi

Author

Nanmanat Disayakamonpan

In this analysis, I explore the delhi dataframe to gain insights into housing prices in Delhi. Subsequently, I aim to develop predictive models for estimating the rent of both flats and houses in the city.

1 Data Overview and Preparation

Here, there are many interesting variables in the dataset, including details on housing prices, area, location, and amenities such as bedrooms, bathrooms, balcony, lift, and parking. Additionally, it provides information on property features like new/resale status and furnished/unfurnished conditions. Leveraging these variables, we can effectively discern and predict the variations in price ranges across different properties in Delhi.

 [1] "X"                "price"            "Address"          "area"            
 [5] "latitude"         "longitude"        "Bedrooms"         "Bathrooms"       
 [9] "Balcony"          "Status"           "neworold"         "parking"         
[13] "Furnished_status" "Lift"             "Landmarks"        "type_of_building"
[17] "desc"             "Price_sqft"      

2 Pre-Processing

2.1 Units

Code
# 1. Units
delhi$area_sqm <- delhi$area * 0.0929

conversion_factor <- 0.011 
delhi$price_eur <- delhi$price * conversion_factor
delhi <- delhi %>%
  mutate(price_per_sqm = price_eur / area_sqm)

2.2 NAs

Code
# 2. NAs
missing_values <- colSums(is.na(delhi))

delhi <- delhi %>%
  mutate(across(where(is.numeric), ~if_else(is.na(.), 0, .)))

#sapply(delhi, function(x) is.numeric(x) && any(x == 0)) #to check which variable have 0
#colSums(is.na(delhi))

Pros and Cons of Replacing Missing Values with 0

Pros:

  • It helps in maintaining the structure of the dataset.

  • It allows to perform calculations on the variables without encountering NA issues.

Cons:

  • It might introduce bias, especially if missing values were not truly zero.

  • It assumes that missing values mean absence rather than unknown or undefined.

2.3 Log-Transformation

Code
#This line selects numeric columns from the delhi dataset excluding the longitude and latitude columns. The result is a vector of column names (cols_no_lonlat) for numeric variables that will undergo log transformation.
cols_no_lonlat <- delhi |> 
  select(where(is.numeric), -c(longitude, latitude)) |> 
  names()

delhi <- delhi |> 
  mutate(
    across(
      where(~is.numeric(.x) && min(.x) == 0),
      ~.x + 1)) |> #the code uses the mutate(across()) function to add 1 to all numeric columns in the dataset (delhi) where the minimum value is 0. This is done to avoid issues with log transformation when the value is 0.
  mutate(
    across(
      all_of(cols_no_lonlat),
      ~log(.x),
      .names = "{.col}_log"
    )
  ) #the code applies the log transformation to all numeric columns identified in the cols_no_lonlat vector (excluding longitude and latitude). The result is stored in new columns with names suffixed by "_log".

2.4 Spliting the Data

Code
set.seed(123)  # for reproducibility
train_index <- sample(1:nrow(delhi), 0.7 * nrow(delhi))

# Create training and testing datasets
delhi_train <- delhi[train_index, ]
delhi_test <- delhi[-train_index, ]

Here I select a split 70% of the data for training and the rest for testing because I think 80% can provide more information for model training but could also leave less data for testing and validation. On the other hand, a smaller training set may lead to underfitting. Therefore, I think 70% for training set share is the best choice for the split.

3 Exploratory Analysis

3.1 Mapping Space and Price per Square Meter

3.1.1 Original and Log-Transformed Price per Square Meter of Flats and Houses in Delhi

In this map, each data point is assigned a color based on its original price per square meter. Darker red colors may represent higher prices, and dark blue colors may represent lower prices. This map can show us a direct visualization of the spatial distribution of housing prices in Delhi. It allows us to quickly identify areas with higher or lower average prices. As the graph shown above, we can see that the cheaper price per square meter is around the outer area of the city center. Meanwhile the more expensive area is close to the city center of Delhi.

In this map, colors are assigned based on the log-transformed price per square meter. Darker red and blue colors may still represent higher and lower prices, respectively, but now the scale is logarithmic. I used log transformation to handle skewed data and make it easier to visualize the relative differences in lower price ranges.

Generally, the colors of the second map may be more informative to enhance the visibility of the price ranges and highlight the relative differences across a wide range of prices in the city of Delhi. However, I would prefer using original price per square meter because it is easier to interpret each data point based on the original price per square meter and allow us to quickly identify areas with higher and lower average prices.

3.1.2 Original and Log-Transformed Area per Square Meter of Flats and Houses in Delhi

In this map, prominent clusters of dark blue colors across the city of Delhi indicate areas with generally smaller absolute sizes of flats and houses. However, the presence of some green and dark red data points signifies specific areas characterized by particularly spacious flats and houses, offering a nuanced view of the diverse housing sizes within the city.

Conversely, using the log-transformed area per square meter allows for the accentuation of areas where the distribution of the original area per square meter was skewed. This transformation serves to normalize or spread out the values, providing a more insightful visualization and enhancing our ability to perceive relative flats and housing size variations across the city. As the graph shown above, the map based on log-transformed area per square meter effectively highlights that in the north of Delhi may have smaller sizes. Meanwhile, the larger apartments and houses may provided more in the south of the city.

3.2 Categories and Price

When contemplating a property purchase, it is crucial to weigh various factors, including the choice between a flat or an individual house, opting for a new construction or a resale property, deciding on furnished or unfurnished spaces, and determining whether to go for a ready-to-move unit or one under construction.

To effectively analyze and illustrate the disparities in prices associated with these elements, I created a series of boxplots and used the log-transformed price per square meter (price_per_sqm_log) because this price per square meter approach standardizes the comparison by considering the price relative to the size of the property. It helps in assessing the cost efficiency in terms of the space we are getting for the price. Moreover, the data of the original price per square meter has skewed distribution, so I used log-transformation approach to visualize better in terms of the relative differences in price ranges.

3.2.1 Individual Houses and Flats

3.2.2 New or Resales Properties

3.2.3 Furnished or Unfurnished

3.2.4 Ready or Under-Construction Properties

According to the series of boxplots shown above, the best strategy to save money is to buy a flat or a house which is resale, unfernished and ready to move.

3.3 Size and Price

3.3.1 Size and Price with Parking Availability

At this stage, I aim to explore the relationship between apartment size and total price, taking into account the influence of parking availability. I will depict this connection using two sets of variables: the original area and price data, as well as the log-transformed counterparts. By comparing these two visualizations, I intend to illustrate the differences and assess which representation provides more informative insights into the impact of parking availability on the size and price dynamics of apartments.

As the graphs shown about the size of the apartments and the total price with parking and balcony availability, we can see the main difference between the graphs that each data point are more spread with log-transformed data, compared to the original one. However, I think the first approach with the original area per square meter and the original total price is more informative than the log-transformation because we can interpret and relate the colors to our real-world perception of high and low prices.

3.3.2 Size and Price with Balcony Availability

4 Preliminaries and Hypothesis Testing

4.1 Spliting the Data and Setting the Algorithm

Before moving on to hypothesis testing, I will conduct a preliminary stage by splitting the data set into 60% for training data and the rest for testing data.

Code
set.seed(0421)

data_split <- initial_split(delhi, prop = 0.6) 

# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

Regression_OLS <- linear_reg() |>
  set_mode("regression") |> # Machine learning: regression or classification
  set_engine("lm")

4.2 Two Variables and Two Hypotheses

4.2.1 An Overview of the Correlations in the Data

For the first start, I would like to create a correlation matrix to understand an overview of the correlations in the data and which variables are high correlated with total price. As the output shown, you can see that the total price has high correlation with area_sqm, Bathrooms, and Bedrooms.

4.2.2 Two Predictor Variables Selection and Two Hypotheses

Based on the exploratory analysis part, I select two variables which are area_sqm (area per square meter) and parking which I expect to have a substantial influence on housing price in Delhi.

Next, I will formulate 2 hypotheses about these variables as follows:

Hypothesis 1:

  • Null Hypothesis (H0): There is no significant relationship between the size of the property (area_sqm) and housing prices.

  • Alternative Hypothesis (H1): There is a significant positive/negative relationship between the size of the property (area_sqm) and housing prices.

  • Explanation: If the null hypothesis is rejected, it suggests that the size of the property has a significant impact on housing prices. The direction of the relationship (positive/negative) will indicate whether larger properties tend to have higher or lower prices.

Hypothesis 2:

  • Null Hypothesis (H0): There is no significant relationship between the parking and housing prices.

  • Alternative Hypothesis (H1): There is a significant relationship between the parking and housing prices.

  • Explanation: If the null hypothesis is rejected, it implies that the parking has a significant effect on housing prices. This could mean that either an increase or decrease in the number of parking is associated with a corresponding increase or decrease in housing prices.

4.3 Pre-processing steps

4.3.1 Identify Missing Values

As the original dataset delhi, there are two variables Balcony and parking that have missing value. However, I already identified and handled those missing values in the previous step. In the first pre-processing step before developing predictive models, it is necessary to check whether there are any missing values in the response variable price_eur and in these two predictor variables area_sqm and parking because missing values in these variables can affect the performance of my predictive model and I would like to ensure the overall quality of my dataset to not lead to biased or incomplete analysis.

Code
# 1. check if these response/predictor variables have any missing values
summary(is.na(delhi$price_eur))
   Mode   FALSE 
logical    7738 
Code
summary(is.na(delhi$area_sqm)) 
   Mode   FALSE 
logical    7738 
Code
summary(is.na(delhi$parking))
   Mode   FALSE 
logical    7738 

4.3.2 Log-Transformation for Skewed Data

Another pre-processing step I used is log transformation becauseI noticed that the response variable price_eur and the predictor variables area_sqm have skewed distributions which can impact the performance of models to be violated. Since I would like to mitigate the impact of extreme values and make the distribution more symmetrical, I decided to use log-transformation with these variables to improve the performance of the models that require more balanced data.

Code
# 2. check if these response/predictor variables have normal or skewed distribution
hist(delhi$price_eur) #skewed distribution

Code
hist(delhi$price_eur_log) # normal distribution

Code
hist(delhi$area_sqm) #skewed distribution

Code
hist(delhi$area_sqm_log) # normal distribution

Code
suppressWarnings({
  train_data %>% 
    ggplot(aes(x = area_sqm, y = price_eur)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE, formula = y ~x, warning = FALSE) +
    theme_light()
})

Code
suppressWarnings({
  train_data %>% 
    ggplot(aes(x = area_sqm_log, y = price_eur_log)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE, formula = y ~x, warning = FALSE) + 
    theme_light()
})

As mentioned, addressing missing values and skewed distributions in the pre-processing step is crucial to build and increase the robustness and accuracy of the predictive models.

4.4 Fitting the Model (Intercept the Model)

Another step is to fit the model by using the training data to estimate the parameters of the models such as the intercept and coefficients for each predictor variable.

Code
TM0 <- fit(Regression_OLS, price_eur ~ 1, data = train_data)
#glance(TM0)
rmse(as.data.frame(cbind(train_data$price_eur, TM0$fit$fitted.values)), V1, V2)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      79811.

As you can see, the RMSE value of 79811.5 indicates the average absolute difference between the actual prices (price_eur) and the predicted prices by the model. In other words, on average, the model’s predictions are off by approximately 79811.5 units of the currency used for the prices.

4.5 Model Development

As this stage, I selected two variables area_sqm and parking as predictors which I think they may influence housing price. Then, I developed various predictive models to see which models will perform the best for predicting the housing price in Delhi.

4.5.1 Model 1: TM1 Model

Code
TM1 <- fit(Regression_OLS, price_eur ~ area_sqm, data = train_data)
tidy(TM1)
# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  -41649.   1356.       -30.7 9.37e-189
2 area_sqm       1014.      9.16     111.  0        
Code
glance(TM1)
# A tibble: 1 × 12
  r.squared adj.r.squared  sigma statistic p.value    df  logLik     AIC     BIC
      <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl>
1     0.725         0.725 41828.    12260.       0     1 -55983. 111971. 111991.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Model 1 (TM1):

  1. Coefficients: The coefficient for area_sqm is 1014.41. This suggests that for each additional square meter increase in the area, the predicted price_eur increases by 1014.41 euros.

  2. Standard Errors: The standard error for area_sqm is 9.16.

  3. Significance: Both intercept and area_sqm have very small p-values (close to zero), indicating that they are statistically significant.

  4. Goodness of Fit: The R-squared value is 0.7255. This means that approximately 72.55% of the variability in price_eur is explained by the model. The R-squared value is relatively high, suggesting that the model explains a substantial portion of the variability in price_eur.

4.5.2 Model 2: TM2 Model

Code
TM2 <- fit(Regression_OLS, price_eur ~ parking, data = train_data)
tidy(TM2)
# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  92355.     1178.      78.4    0    
2 parking        -33.8      31.1     -1.09   0.277
Code
glance(TM2)
# A tibble: 1 × 12
  r.squared adj.r.squared  sigma statistic p.value    df  logLik     AIC     BIC
      <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl>
1  0.000255     0.0000391 79819.      1.18   0.277     1 -58982. 117971. 117990.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Model 2 (TM2):

  1. Coefficients: The coefficient for parking is -33.77. This suggests that, holding other variables constant, each unit increase in the parking variable is associated with a decrease in the predicted price_eur by 33.77 euros.

  2. Standard Errors: The standard error for parking is 31.07.

  3. Significance: The p-value for parking is 0.2772, which is larger than the typical significance level of 0.05. This suggests that the variable parking may not be statistically significant in predicting price_eur.

  4. Goodness of Fit: The R-squared value is 0.0002545. This indicates that the model explains a very small proportion of the variability in price_eur. In other words, the predictor variable parking does not contribute much to explaining the variation in the response variable or the model has very low explanatory power.

4.5.3 Model 3: TM3 Model

In real estate, the number of bedrooms and bathrooms are fundamental features that influence housing prices. On the other hand, the presence of a balcony is an additional feature that can impact the overall price of a property. Moreover, based on the correlation matrix, Bedrooms and Bathrooms are highly correlated with price_eur (total price), indicating a strong relationship.

As mentioned, I selected Bathrooms, Bedrooms and Balcony as control variables and add in the following models to see whether the predictive models will perform better than the previous models.

Code
TM3_Model <- recipe(
  price_eur ~ area_sqm + Bathrooms + Bedrooms + Balcony,
  data = train_data
)
summary(TM3_Model)
# A tibble: 5 × 4
  variable  type      role      source  
  <chr>     <list>    <chr>     <chr>   
1 area_sqm  <chr [2]> predictor original
2 Bathrooms <chr [2]> predictor original
3 Bedrooms  <chr [2]> predictor original
4 Balcony   <chr [2]> predictor original
5 price_eur <chr [2]> outcome   original
Code
TM3 <- fit(Regression_OLS, price_eur ~ area_sqm + Bathrooms + Bedrooms + Balcony, data = train_data)
tidy(TM3)
# A tibble: 5 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  -50897.    2308.    -22.0   1.51e-102
2 area_sqm        921.      15.4    59.6   0        
3 Bathrooms     10238.    1280.      8.00  1.62e- 15
4 Bedrooms       -578.    1111.     -0.520 6.03e-  1
5 Balcony        -939.     433.     -2.17  3.02e-  2
Code
glance(TM3)
# A tibble: 1 × 12
  r.squared adj.r.squared  sigma statistic p.value    df  logLik     AIC     BIC
      <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl>
1     0.730         0.730 41484.     3136.       0     4 -55943. 111898. 111937.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Model 3 (TM3):

  1. Coefficients:
  • The coefficient for area_sqm is 920.56. For each additional square meter of area, the estimated price increases by 920.56 euros.

  • The coefficient for Bathrooms is 10,237.59. This suggests that, each additional bathroom is associated with an increase in price by 10,237.59 euros.

  • The coefficient for Bedrooms is -578.10. This suggests that, each additional bedroom is associated with a decrease in price by 578.10 euros. This result might seem strange, and I need to consider interactions with other variables.

  • The coefficient for Balcony is -939.48. This suggests that, each additional balcony is associated with a decrease in price by 939.48 euros.

  1. Standard Errors: all the standard errors seem reasonable, suggesting that the coefficient estimates are likely reliable.
  • The standard error for area_sqm is 15.43.

  • The standard error for Bathrooms is 1280.45.

  • The standard error for Bedrooms is 1111.33.

  • The standard error for Balcony is 433.44.

  1. Significance:
  • area_sqm and Bathrooms: These variables have very low p-values (close to zero), indicating that they are statistically significant predictors of the price.

  • Bedrooms and Balcony: Bedrooms have a p-value greater than 0.05, suggesting it might not be a statistically significant predictor. Balcony, with a p-value of 0.03, is statistically significant at the 0.05 level.

  1. Goodness of Fit: The R-squared value of 0.7301 indicates that the model explains approximately 73.01% of the variability in the response variable (price). This suggests a reasonably good fit.

In summary, this model suggests that area_sqm and Bathrooms are strong predictors of the housing price, while Bedrooms and Balcony may have less impact.

4.5.4 Model 4: TM4 Model

Code
TM4 <- fit(Regression_OLS, price_eur ~ parking + Bathrooms + Bedrooms + Balcony, data = train_data)
tidy(TM4)
# A tibble: 5 × 5
  term         estimate std.error statistic   p.value
  <chr>           <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept) -91536.      2933.    -31.2   3.25e-194
2 parking         -7.73      21.5    -0.360 7.19e-  1
3 Bathrooms    51025.      1439.     35.5   7.03e-244
4 Bedrooms     17908.      1419.     12.6   6.24e- 36
5 Balcony       2688.       571.      4.71  2.58e-  6
Code
glance(TM4)
# A tibble: 1 × 12
  r.squared adj.r.squared  sigma statistic p.value    df  logLik     AIC     BIC
      <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl>
1     0.523         0.523 55147.     1271.       0     4 -57264. 114541. 114580.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Model 4 (TM4):

  1. Coefficients:
  • The coefficient for parking is -7.73. Each additional parking space is associated with a decrease in price by 7.73 euros.

  • The coefficient for Bathrooms is 51,025.25. This suggests that, each additional bathroom is associated with an increase in price by 51,025.25 euros.

  • The coefficient for Bedrooms is 17,908.05. This suggests that, each additional bedroom is associated with an increase in price by 17,908.05 euros.

  • The coefficient for Balcony is 2,687.60. This suggests that, ach additional balcony is associated with an increase in price by 2,687.60 euros.

  1. Standard Errors: all the standard errors seem reasonable, suggesting that the coefficient estimates are likely reliable.
  • The standard error for area_sqm is 21.48.

  • The standard error for Bathrooms is 1439.08.

  • The standard error for Bedrooms is 1418.87.

  • The standard error for Balcony is 570.87.

  1. Significance:
  • The parking variable has a p-value greater than 0.05, suggesting it might not be a statistically significant predictor.

  • However, all other variables have very low p-values (close to zero), indicating that they are statistically significant predictors of the price.

  1. Goodness of Fit: The R-squared value of 0.5231 indicates that the model explains approximately 52.31% of the variability in the response variable (price). This suggests a moderate fit.

In summary, this model suggests that Bathrooms, Bedrooms, and Balcony are statistically significant predictors of housing price. However, parking does not appear to be a significant predictor in this model.

4.6 Model with Pre-processing Steps: PCA or Log-Transformation

From my perspective, I want to develop models with both pre-processsing steps (PCA or Log-transformation) to see the differences between them.

Code
#names(train_data) # get a list of variable names to speed this up

TM3ln_Model <- recipe(price_eur_log ~ ., # we need to select all variables first for pre-processing to work
                      data = train_data) %>% # same as above
  step_log(area_sqm, offset =  1, base = 10) %>% # log-transform predictor
  step_log(Bathrooms, offset =  1, base = 10) %>% 
  step_log(Bedrooms, offset =  1, base = 10) %>% 
  step_log(Balcony, offset =  1, base = 10) %>% 
  step_rm("price_eur", # remove the old response / dependent variable
          "X", "price", "Address", "area", "latitude", "longitude", # remove all predictors not to be included
          "Status", "neworold", "parking", "Furnished_status", "Lift", 
          "Landmarks", "type_of_building", "desc", "Price_sqft", "price_per_sqm",
          "X_log","price_log", "area_log", "Bedrooms_log", "Bathrooms_log", "Balcony_log",
          "parking_log", "Lift_log", "Price_sqft_log", "area_sqm_log", "price_per_sqm_log") 
summary(TM3ln_Model) 
# A tibble: 33 × 4
   variable  type      role      source  
   <chr>     <list>    <chr>     <chr>   
 1 X         <chr [2]> predictor original
 2 price     <chr [2]> predictor original
 3 Address   <chr [3]> predictor original
 4 area      <chr [2]> predictor original
 5 latitude  <chr [2]> predictor original
 6 longitude <chr [2]> predictor original
 7 Bedrooms  <chr [2]> predictor original
 8 Bathrooms <chr [2]> predictor original
 9 Balcony   <chr [2]> predictor original
10 Status    <chr [3]> predictor original
# ℹ 23 more rows
Code
TM3ln_Model_prep <- prep(TM3ln_Model, training = train_data)

PCA_Model <- recipe(price_eur ~., data = train_data) |>  # Full Model
  step_rm(price_eur_log, # remove alternative response
          "X", "price", "Address", "area", "latitude", "longitude", # remove the factors & other predictors
          "Status", "neworold", "Furnished_status", "Landmarks", 
          "type_of_building", "desc", "Price_sqft", "price_per_sqm","X_log",
          "price_log", "area_log", "Bedrooms_log", "Bathrooms_log", "Balcony_log",
          "parking_log", "Lift_log", "Price_sqft_log", "area_sqm_log", "price_per_sqm_log") |>
  step_center(all_predictors()) %>%
  step_scale(all_predictors()) %>%
  step_pca(c("area_sqm", "Bedrooms", "Bathrooms"), num_comp = 1, prefix = "PC1main") %>%
  step_pca(c("Balcony", "parking", "Lift"), num_comp = 1, prefix = "PC2additional") 
summary(PCA_Model)
# A tibble: 33 × 4
   variable  type      role      source  
   <chr>     <list>    <chr>     <chr>   
 1 X         <chr [2]> predictor original
 2 price     <chr [2]> predictor original
 3 Address   <chr [3]> predictor original
 4 area      <chr [2]> predictor original
 5 latitude  <chr [2]> predictor original
 6 longitude <chr [2]> predictor original
 7 Bedrooms  <chr [2]> predictor original
 8 Bathrooms <chr [2]> predictor original
 9 Balcony   <chr [2]> predictor original
10 Status    <chr [3]> predictor original
# ℹ 23 more rows
Code
# Execute the preprocessing (optional)
PCA_Model_prep <- prep(PCA_Model, training = train_data)

Considering the characteristics of the original dataset, I prefer employing log-transformation as a pre-processing step for the model. This choice is suitable for the observed skewed distribution in both response and predictor variables. Log-transformation is beneficial in this context as it helps create more symmetric distributions, thereby enhancing model performance. Additionally, Principal Component Analysis (PCA) is a suitable approach when dealing with high-dimensional data, but since the current models do not involve a large number of predictors, the need for dimensionality reduction is not prominent in this case.

4.7 Training the Models and Getting the Training Errors

Code
TM3_ols_wf <- workflow() %>%
  add_recipe(TM3_Model) %>% ## terminology: this is the model
  add_model(Regression_OLS) #

TM3ln_ols_wf <- workflow() %>%
  add_recipe(TM3ln_Model) %>% ## terminology: this is the model
  add_model(Regression_OLS) # terminology!!! this is the algorithm we are adding

PCA_ols_wf <- workflow() %>%
  add_recipe(PCA_Model) %>% ## terminology: this is the model
  add_model(Regression_OLS) 

4.8 Training the Models

Code
TM3_ols_fit <- fit(TM3_ols_wf, data = train_data)

TM3ln_ols_fit <- fit(TM3ln_ols_wf, data = train_data)

PCA_ols_fit <- fit(PCA_ols_wf, data = train_data)

4.9 An Overview of the Model Results

4.9.1 TM3 Model Result

Code
# TM3_ols_fit
rmse(as.data.frame(cbind(train_data$price_eur, TM3_ols_fit$fit$fit$fit$fitted.values)),V1, V2)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      41462.

4.9.2 TM3ln Model Result

Code
# TM3ln Model on original price
rmse(as.data.frame(cbind(train_data$price_eur, 
                         exp(TM3ln_ols_fit$fit$fit$fit$fitted.values))),V1, V2)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      44242.

4.9.3 PCA Model Result

Code
# PCA_ols_fit
rmse(as.data.frame(cbind(train_data$price_eur, PCA_ols_fit$fit$fit$fit$fitted.values)),V1, V2)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      48026.

4.9.4 Coefficient and Overall Model Output

The coefficients provide insights into how each predictor influences the model which are summarized as follows:

 (1)   (2)   (3)
(Intercept) −50896.544*** 5.290*** 92224.526***
(2308.239) (0.063) (705.128)
area_sqm 920.563*** 2.441***
(15.434) (0.046)
Bathrooms 10237.589*** 0.917***
(1280.451) (0.091)
Bedrooms −578.095 0.601***
(1111.340) (0.086)
Balcony −939.479* 0.000
(433.446) (0.027)
PC1main1 39923.851***
(451.849)
PC2additional1 3150.901***
(622.467)
Num.Obs. 4642 4642 4642
R2 0.730 0.724 0.638
BIC 111936.5 2803.5 113284.1

An overview of the the model results indicate that:

  • TM3 model has the lowest RMSE, indicating better predictive performance.

  • The R-squared values suggest that TM3 model has the highest goodness of fit among the models.

In conclusion, based on the RMSE and R-squared values, the TM3 Model appears to be the best-performing model for predicting housing prices in Delhi.

(to be continued)